How we reduced our Prometheus infrastructure footprint by a third
Prometheus sharding crash course
Prometheus works by collecting metrics from services, which is referred as scraping. To handle large amounts of data, the load can be distributed across multiple Prometheus instances, a technique called sharding. One common method of sharding is by service, where each Prometheus instance is responsible for collecting metrics from a subset of all services.
As the number of metrics being exposed by a single service increases, due to the growing number of exposed metrics of the sole number of instances, it may become necessary to distribute the scraping of these metrics across multiple Prometheus instances. However, this can create challenges in evaluating recording rules, as all the necessary metrics must be available in the same Prometheus instance, otherwise the generated rules will be partial.
How we shard at Criteo
The metrics of most applications at Criteo have a graphite path stored in the name label as they are sent to graphite after being aggregated by Prometheus. The aggregations rules are base on this name only, so we can use it as a grouping key to shard our metrics. For the service with the most metrics, we introduce this metrics_relabel_config
:
metrics_relabel_configs:
- source_labels: [name]
modulus: '<number of shards>'
action: hashmod
target_label: __tmp_shard
- source_labels: [__tmp_shard]
action: keep
regex: '<shard>'
Prometheus will keep only metrics that match this hashmod. The number of shards, and shard number provided by environment variable.
Problem solved right ? Not quite …
We realized that this worked pretty well for a while, but we realized that the cost of scraping metrics for our bigger services was growing non-linearly.
For instance, the memory cost of monitoring our very biggest application:
- Early 2020 5 instances of 50GB (total 250 GB)
- Early 2021 we had 30 instances of 20GB (total 600GB)
- Early 2022 we had 30 instances of 35GB (total 1050GB)
- October 2022 80 instances of 70GB (total 5600GB)
CPU cost is increasing in the same manner.
The way we implemented the sharding was obviously not scaling as well as we wanted, so we started investigating.
Identifying the problem
Let’s do some quick math on the metric dc:prometheus_samples_scraped:sum
on recent values:
# Number of metrics on Prometheus side
dc:prometheus_samples_scraped:sum ≃ 7.6 billions
# Number of metrics on the service instances
average_metrics_exposed_per_instance ≃ 28k
number_of_instance = 3.4k
Total Metrics Exposed = 28k x 3.4k = 95,2 millions ≠ 7.6 billions
Interestingly enough, the difference between the two amounts is exactly 80, which is the number of shards !
Let’s go a bit deeper, shall we ? Using Prometheus built-in pprof endpoint, it’s quite handy to get an overview of the memory or CPU usage of each part of its code with the following command:
go tool pprof -svg <prometheus url>/debug/pprof/heap > heap.svg
This generates images that we can use to identify what is using this many resources. The following is a profile of a sharded Prometheus instance:
If we zoom on the scrape we can see that PromParser.Metric
and the scrapeCache.addDropped
are the biggest usage of memory for a total of 76% of this particular instance:
The problem is now much clearer. Dropping those metrics is actually quite expensive as it turns out. The instances have to decode all the metrics in the /metrics, which represents a lot of metrics: 7.6 billion every minute or 120 million per second. Most (98.75% for 80 shards) of those metrics must then be dropped and the result stored in the scrapeCache
that have to be huge to store all those metrics. Remember that each instance uses 1/80th of the metrics it scrapes.
This makes perfect sense on the side of Prometheus to have a cache on metrics_relabel_config
as it is continuously scrapping mostly the same metrics on the same target with only the value changing.
The fix
To address this issue, we had to make the sharding filtering during the scrape rather than after. To do so, we pass the shard number and the shard count when we scrape the instance like this:
- job_name: <sharded service>
params:
shard: "<shard>"
shard_count: "<shard count>"
In the C# application code, the following filter is applied for each metric scrapped:
if (_shardsCount > 1)
{
var bytes = XxHash64.Hash(Encoding.UTF8.GetBytes(metricName));
if (BitConverter.IsLittleEndian)
Array.Reverse(bytes);
return BitConverter.ToUInt32(bytes, 0) % _shardsCount == _shard;
}
The (impressive) results
When we rolled out the change, this happened this happened for the total memory (in GB) used and total CPU of the sharded Prometheus instances:
This represents our whole Prometheus stack:
- 14TB to 10TB of memory so a net savings of 4TB (28%)
- 1100 to 675 physical CPUs used so a net savings of 400 CPUs (38%)
- 250 Gb/s to 25Gb/s of incoming network traffic (local to the data-center) so net savings of 225Gb/s (90%)
Conclusion
Our investigation into the Prometheus sharding issue revealed that the root cause was the scalability limits in the Prometheus drop metric_relabel_configs
.
To fix this issue, we implemented a solution that filters the metrics during the scrape process rather than after, reducing the overall footprint of our Prometheus setup.
This change not only helped us to improve the efficiency and performance of our system but also allowed us to utilize our resources better and reduce waste.